CPSC 330 Lecture 9: Classification Metrics

Varada Kolhatkar

Focus on the breath!

Announcements

  • Important information about midterm 1
    • https://piazza.com/class/mekbcze4gyber/post/162
    • Good news for you: You’ll have access to our course notes in the midterm!
  • HW4 was due on Monday, Oct 6th 11:59 pm.
  • HW5 has been released. It’s a project-type assignment and you get till Oct 27th to work on it.

ML workflow

Accuracy

  • So far, we’ve been measuring model performance using Accuracy.
  • Accuracy is the proportion of all predictions that were correct — whether positive or negative.

\[ \text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} \]

  • But is accuracy always the right metric to evaluate a model? 🤔

A fraud classification example

Class Time Amount V1 V2 V3 V4 V5 V6 V7 ... V19 V20 V21 V22 V23 V24 V25 V26 V27 V28
64454 0 51150.0 1.00 -3.538816 3.481893 -1.827130 -0.573050 2.644106 -0.340988 2.102135 ... -1.509991 1.345904 0.530978 -0.860677 -0.201810 -1.719747 0.729143 -0.547993 -0.023636 -0.454966
37906 0 39163.0 18.49 -0.363913 0.853399 1.648195 1.118934 0.100882 0.423852 0.472790 ... 0.810267 -0.192932 0.687055 -0.094586 0.121531 0.146830 -0.944092 -0.558564 -0.186814 -0.257103
79378 0 57994.0 23.74 1.193021 -0.136714 0.622612 0.780864 -0.823511 -0.706444 -0.206073 ... 0.258815 -0.178761 -0.310405 -0.842028 0.085477 0.366005 0.254443 0.290002 -0.036764 0.015039
245686 0 152859.0 156.52 1.604032 -0.808208 -1.594982 0.200475 0.502985 0.832370 -0.034071 ... -1.009429 -0.040448 0.519029 1.429217 -0.139322 -1.293663 0.037785 0.061206 0.005387 -0.057296
60943 0 49575.0 57.50 -2.669614 -2.734385 0.662450 -0.059077 3.346850 -2.549682 -1.430571 ... 0.157993 -0.430295 -0.228329 -0.370643 -0.211544 -0.300837 -1.174590 0.573818 0.388023 0.161782

5 rows × 31 columns

DummyClassifier

Let’s try a DummyClassifier, which makes predictions without learning any patterns.

dummy = DummyClassifier()
pd.DataFrame(cross_validate(dummy, X_train, y_train, return_train_score=True)).mean()
fit_time       0.018004
score_time     0.003109
test_score     0.998300
train_score    0.998300
dtype: float64
  • The accuracy looks surprisingly high!
  • Should we be happy with this model and deploy it?

Problem: Class imbalance

  • In many real-world problems, some classes are much rarer than others.

  • A model that always predicts “no fraud” could still achieve >99% accuracy!

  • This is why accuracy can be misleading in imbalanced datasets.

  • We need metrics that differentiate types of errors.

Fraud Confusion matrix

Which types of errors would be most critical for the bank to address?

  • Missing a fraud case?

  • Or flagging a legitimate transaction as fraud?

Understanding the confusion matrix

  • TN \(\rightarrow\) True negatives
  • FP \(\rightarrow\) False positives
  • FN \(\rightarrow\) False negatives
  • TP \(\rightarrow\) True positives

Practice: confusion matrix terminology

Confusion matrix questions

Imagine a spam filter model where emails labeled 1 = spam, 0 = not spam. If a spam email is incorrectly classified as not spam, what kind of error is this?

    1. A false positive
    1. A true positive
    1. A false negative
    1. A true negative

Confusion matrix questions

In an intrusion detection system, 1 = intrusion, 0 = safe. If the system misses an actual intrusion and classifies it as safe, this is a:

    1. A false positive
    1. A true positive
    1. A false negative
    1. A true negative

Confusion matrix questions

In a medical test for a disease, 1 = diseased, 0 = healthy. If a healthy patient is incorrectly diagnosed as diseased, that’s a:

    1. A false positive
    1. A true positive
    1. A false negative
    1. A true negative

Now that we understand the different types of errors, we can explore metrics that better capture model performance when accuracy falls short.

Precision, Recall, F1-Score

iClicker Exercise 9.1

Select all of the following statements which are TRUE.

    1. In medical diagnosis, false positives are more damaging than false negatives (assume “positive” means the person has a disease, “negative” means they don’t).
    1. In spam classification, false positives are more damaging than false negatives (assume “positive” means the email is spam, “negative” means they it’s not).
    1. If method A gets a higher accuracy than method B, that means its precision is also higher.
    1. If method A gets a higher accuracy than method B, that means its recall is also higher.

Counter examples

Method A - higher accuracy but lower precision

Negative Positive
90 5
5 0

Method B - lower accuracy but higher precision

Negative Positive
80 15
0 5

Thresholding

  • The above metrics assume a fixed threshold.

  • We use thresholding to get the binary prediction.

  • A typical threshold is 0.5.

    • A prediction of 0.90 \(\rightarrow\) a high likelihood that the transaction is fraudulent and we predict fraud
    • A prediction of 0.20 \(\rightarrow\) a low likelihood that the transaction is non-fraudulent and we predict Non fraud
  • What happens if the predicted score is equal to the chosen threshold?

  • Play with classification thresholds

iClicker Exercise 9.2

Select all of the following statements which are TRUE.

    1. If we increase the classification threshold, both true and false positives are likely to decrease.
    1. If we increase the classification threshold, both true and false negatives are likely to decrease.
    1. Lowering the classification threshold generally increases the model’s recall.
    1. Raising the classification threshold can improve the precision of the model if it effectively reduces the number of false positives without significantly affecting true positives.

PR curve

  • Calculate precision and recall (TPR) at every possible threshold and graph them.
  • Better choice for highly imbalanced datasets

ROC curve

  • Calculate the true positive rate (TPR) and false positive rate (FPR) at every possible thresholding and graph TPR over FPR.
  • Good choice when the datasets are roughly balanced.

AUC

  • The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

ROC AUC questions

Consider the points A, B, and C in the following diagram, each representing a threshold. Which threshold would you pick in each scenario?

    1. If false positives (false alarms) are highly costly
    1. If false positives are cheap and false negatives (missed true positives) highly costly
    1. If the costs are roughly equivalent

Source